程式設計大量平行處理器：實務導向課程：CUDA 執行模型：主機對設備

CUDA 執行模型將您的電腦轉換為高效率的異質系統。想像一位 總指揮官（主機／CPU） 與一群 千人部隊（設備／GPU）。總指揮官負責複雜的邏輯與決策，而千人部隊則同時執行龐大的重複性任務。

1. 結構上的差異

主機主機是針對延遲優化的中央處理器，專為複雜的控制流程與串列任務設計。相反地，設備設備是針對吞吐量優化的圖形處理器，內含數以千計的簡單核心，可同時在龐大的資料集上執行相同的指令。

2. 執行節奏

CUDA 程式運作於一系列階段中。執行從主機開始處理「串列程式碼」。當程式遇到「平行核心」時，會在設備上啟動一個網格的線程網格。一旦設備完成其龐大的工作負載，控制權便回歸至主機。

3. 性能專精

此模型善用兩者的優勢：中央處理器管理系統資源與複雜分支，而圖形處理器則執行 SPMD（單一程式、多資料） 邏輯以平行方式處理資料元素。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.